Note: All bolded, underlined words are hyperlinks. In this project, I attempt to find the climaxes of the series Malazan Book of the Fallen by using network, text, and sentiment analysis. This series is notable in that it is one of the most complex and long fantasy series with a continuous single plot-line. There are 3,252,031 words in the series and I estimated that there are at least 1,314 characters in the series with approximately 457 unique points of view. Additionally, many of the characters have multiple aliases and nicknames adding another layer of complexity. A character might go by completely different names in different novels.
I began by text mining the co-occurrence data from the books, I’ll elaborate more on that process later. From there, I had to get the co-occurrence data into a reasonable format and clean the name data. I then used the AFINN Lexicon to get the sentiment data. Finally, I compared the sentiment data with the network data to try to see which works better to determine the climax of each of the 10 books.
The AFINN Lexicon contains 2,476 words with negativity and positivity scores between -5 and 5.
Numerous datasets from several sources were used in this project. The co-occurrence data was mined from the books by me. The books were converted from .epub format to .txt format. I made most of the alias data manually and crowd-sourced some of the aliases on Reddit. The name data was extracted from the Dramatis Personae sections at the start of each book and manually extracted from the Malazan Wiki.
I began by joining the character name data with the alias data into a single dataset. I then split all of this data by spaces in order to get variations of the names and rejoined the partial name data back to the full name data to get a comprehensive dataset of full and partial names. I then filtered out stop words, formal titles, military ranks, and commonly capitalized words that aren’t names from this list. Then, I arranged the list by string length.
At this point, I went back to the book data and turned the text into ngrams of length 20 by book and chapter. I chose to use ngrams so that I would get the full co-occurrence relations within the 20 word groups. For example, “Ron Jon” then “Ron Jon Bob” and finally “Jon Bob”. Then I wrote a function that tries to extract every name in the name list from the ngram. If there is a match, then it also removes the match from the string for subsequent iterations. Otherwise, “Brys Beddict” would be extracted 3 times. Some of the code for this process will be shown in the appendix.
This method of extraction was computationally intensive as there were 3,080 names in my name list and 3,250,530 ngrams. After several iterations of my code, I managed to get it to run in around 40 hours on my laptop. The original speed was about 310 seconds per 1600 ngrams, and I reduced that to approximately 70 seconds per 1600.
Once the co-occurrence data was in a usable format, I moved on to data cleaning. Much of the data cleaning was due to the fact that there were partial name matches due to the use of ngrams “John Smith” matched as “John” and because some characters have up to 9 different names. Using regular expressions, I formatted and removed variations of all names with over 100 appearances in the network data. I made three assumptions at this stage: Note, when I use the word “importance” I am generally talking about something with a high centrality measure.
I’ll jump right into it by giving the top 10 most important characters using PageRank compared with their degree centrality importance. I only used un-directed network graphs for everything that follows.
| Character | Ranking - Degree Centrality | Ranking - PageRank |
|---|---|---|
| Tavore Paran | 1 | 1 |
| Ben Adaephon Delat | 2 | 2 |
| Hood | 5 | 3 |
| Ganoes Stabro Paran | 4 | 4 |
| Fiddler | 3 | 5 |
| Kalam Mekhar | 6 | 6 |
| Whiskeyjack | 8 | 7 |
| Rhulad Sengar | 10 | 8 |
| Gesler | 7 | 9 |
| Anomander Rake | 14 | 10 |
Degree centrality and PageRank mostly agree on the most important characters in the series, but they start to greatly diverge the further they get away from the top 10. As shown in the figure to the right.
In the following network graph, I decided to use PageRank after experimenting with other centrality measures.